11.2 - Advanced Concepts to Explore - HRL, MARL
The monolithic, single-agent architectures we have implemented are powerful but have inherent limitations in solving problems that require both long-term strategy and fine-grained, parallel control. To address this, the field of reinforcement learning offers more advanced architectural paradigms.
This section provides a high-level overview of two of the most important of these: Hierarchical Reinforcement Learning (HRL) and Multi-Agent Reinforcement Learning (MARL).
Important Disclaimer- Beyond stable-baselines3
The standard stable-baselines3
library is a single-agent framework. It is not designed to handle the multi-policy systems required for HRL or the decentralized nature of MARL out-of-the-box. Implementing these paradigms requires specialized libraries (like ray[rllib]
) or significant custom development. The purpose of this section is to introduce the concepts so you can recognize when they are the appropriate next step in your research.
Hierarchical Reinforcement Learning (HRL)
HRL addresses the challenge of long-term credit assignment by decomposing a single, complex policy into a hierarchy of policies, each operating at a different level of temporal abstraction.
The Architectural Pattern- Manager and Worker
- Manager (High-Level Policy): Observes the global game state and selects a high-level goal (e.g., "expand now," "attack enemy base"). It operates on a slow timescale.
- Worker (Low-Level Policy): Receives the current state and the manager's current goal. Its job is to output the low-level actions (move, attack, build) necessary to achieve that specific goal.
Information Flow:
+--------------------------+
| Global State |
+------------+-------------+
|
v
+--------------------------+ (Selects Goal) +----------------+
| Manager Policy |----------------->| Goal |
| (Outputs a sub-goal) | |("Attack Base") |
+--------------------------+ +-------+--------+
|
v
+-----------------------------------------------------+
| Worker Policy |
| (Receives State + Goal, outputs low-level actions) |
+-----------------------------------------------------+
HRL Solves | HRL Introduces a Challenge |
---|---|
Long-Term Planning: The manager only needs to learn high-level strategy, a much simpler problem than micro-managing every unit. | Goal and Reward Definition: Defining a useful set of abstract sub-goals and designing the reward functions to train the worker policy are non-trivial engineering tasks. |
Multi-Agent Reinforcement Learning (MARL)
MARL addresses problems requiring decentralized, parallel control by modeling each entity as its own independent, learning agent.
The Architectural Pattern- A Team of Agents
- Each unit (or squad) is an individual agent with its own policy network.
- Each agent receives a local observation of the environment.
- All agents act concurrently, and the system learns effective emergent behavior from their collective actions.
Information Flow:
+-------------------+
| Shared SC2 |
| Environment |
+--+-------------+--+
(obs_1, rew_1) ^ ^ (obs_2, rew_2)
| | | |
v | | v
+-----------+------------+ | | +------------+-----------+
| Agent 1 (Policy 1) | | | | Agent 2 (Policy 2) |
| (Outputs action_1) | | | | (Outputs action_2) |
+------------------------+ | | +------------------------+
| | | |
+------------->+ +<-------------+
action_1 action_2
MARL Solves | MARL Introduces a Challenge |
---|---|
Fine-Grained Micro: It is a natural fit for complex squad-level combat, allowing for highly reactive and coordinated control. | Non-Stationarity: From the perspective of any one agent, the environment is constantly changing as its teammates' policies evolve, which can make training unstable. This often requires specialized algorithms (e.g., Centralized Training, Decentralized Execution). |
Where to Go Next
To implement these advanced concepts, you will need to explore frameworks designed for them:
- For HRL and MARL: Ray RLlib is a powerful, industry-standard library that supports a wide variety of advanced RL paradigms.
- For MARL: PettingZoo is a popular library that provides a
gymnasium
-like API specifically for multi-agent environments.